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Abstract 

We apply the splitting method to three well-known counting problems, 
namely 3-SAT, random graphs with prescribed degrees, and binary contingency 
tables. We present an enhanced version of the splitting method based on the 
capture-recapture technique, and show by experiments the superiority of this 
technique for SAT problems in terms of variance of the associated estimators, 
and speed of the algorithms. 
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1 Introduction 

In this paper we apply the splitting method introduced in [5] to a variety of counting 
problems in ^P-complete. Formally, given any decision problem in the class NP, 
e.g. the satisfiability problem (SAT), one can formulate the corresponding counting 
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problem which asks for the total number of solutions for a given instance of the 
problem. In the case of the SAT problem, this corresponding counting problem 
has complexity #SAT. Generally, the complexity class #P consists of the counting 
problems associated with the decision problems in NP. Clearly, a #P problem is 
at least as hard as its corresponding NP problem. In this paper we consider #P- 
complete problems. Completeness is defined similarly as for the decision problems: 
a problem is #P-complete if it is in #P, and if every #P problem can be reduced 
to it in polynomial counting reduction. This means that exact solutions to these 
problems cannot be obtained in polynomial time, and accordingly, our study focuses 
on approximation algorithms. For more background on the complexity theory of 
problems we refer to |13j . 

The proposed splitting algorithm for approximate counting is a randomized one. 
It is based on designing a sequential sampling plan, with a view to decomposing 
a "difficult" counting problem defined on some set X* into a number of "easy" 
ones associated with a sequence of related sets Xq, Xi, . . . , Xm and such that Xm = 
X* . Splitting algorithms explore the connection between counting and sampling 
problems, in particular the reduction from approximate counting of a discrete set to 
approximate sampling of elements of this set, with the sampling performed, typically, 
by some Markov chain Monte Carlo method. 

Recently, counting problems have attracted research interest, notably the so- 
called model counting or ^^SAT, i.e. computing the number of models for a given 
propositional formula Although it has been shown that many solution tech- 

niques for SAT problems can be adapted for these problems, yet due to the exponen- 
tial increase in memory usage and running times of these methods, their application 
area in counting is limited. This drawback motivated the approximative approach 
mentioned earlier. There are two main heuristic algorithms for approximate count- 
ing methods in #SAT. The first one, called ApproxCount, is introduced by Wei and 
Selman in [T7]. It is a local search method that uses Markov Chain Monte Carlo 
(MCMC) sampling to compute an approximation of the true model count of a given 
formula. It is fast and has been shown to provide good estimates for feasible so- 
lution counts, but, in contrast with our proposed splitting method, there are no 
guarantees as to the uniformity of the MCMC samples. Gogate and Dechter [9] 
recently proposed a second model counting technique called SampleMinisat, which 
is based on sampling from the so-called backtrack-free search space of a Boolean 
formula through SampleSearch. An approximation of the search tree thus found is 
used as the importance sampling density instead of the uniform distribution over all 
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solutions. Experiments with SampleMinisat show that it is very fast and typically 
it provides very good estimates. 

The splitting method discussed in this work for counting in deterministic prob- 
lems is based on its classic counterpart for efficient estimation of rare-event probabili- 
ties in stochastic problems. The relation between rare-event simulation methods and 
approximate counting methods have also been discussed, for instance, by Blanchet 
and Rudoy [2j, Botev and Kroese and Rubinstein jT4]; see also [151 Chapter 9]. 

As said, we propose to apply the sequential sampling method presented in [5] 
which yields a product estimator for counting the number of solutions \X*\, where 
the product is taken over the estimators of the consecutive conditional probabilities, 
each of which represents an "easy" problem. In addition, we shall consider an 
alternative version, in which we use the generated samples after the last iteration of 
the splitting algoritm as a sample for the capture-recapure method. This method 
gives us an alternative estimate of the counting problem. Furthermore, we shall 
study an extended version of the capture-recapture method when the problem size is 
too large for the splitting method to give reliable estimates. The idea is to decrease 
artificially the problem size and then apply a backwards estimation. Whenever 
applicable, the estimators associated with our proposed enhancements outperform 
the splitting estimators in terms of variance. 

The paper is organized as follows. We first start with describing the splitting 
method in detail in Section [2j Section [3] deals with the combination of the classic 
capture-recapture method with the splitting algorithm. Finally, numerical results 
and concluding remarks are presented in Sections H] and O respectively. 

2 Splitting Algorithms for Counting 

The splitting method is one of the main techniques for the efficient estimation of 
rare-event probabilities in stochastic problems. The method is based on the idea 
of restarting the simulation in certain states of the system in order to obtain more 
occurrences of the rare event. Although the method originated as a rare event 
simulation technique (see [I], [6], [7], [8], [11], [l2]), it has been modified in [2], [3], 
and [13], for counting and combinatorial optimization problems. 

Consider a NP decision problem with solution set X*, i.e., the set containing 
all solutions to the problem. We are interested to compute the size of the 
solution set. Suppose that there is a larger set X D X* which can be represented 
by a simple description or formula; specifically, its size \X\ is known and easy to 
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compute. We call X the state space of the problem. Denote hy p = \X* \ / \X\ the 
fraction (or "probability") of the solution set w.r.t. the state space. Since \X\ is 
known, it suffices to compute p. In most cases p is extremely small, in other words 
we deal with a rare-event probability. However, assuming we can estimate p by p, 
we obtain automatically 

1^1 = \X\p 

as an estimator of \X*\. Note that straightforward simulation based on generation 
of i.i.d. uniform samples € <Y and delivering the Monte Carlo estimator pmc = 
J! X^i^i ^{Xiex*} an unbiased estimator of fails when p is a rare-event 

probability. To be more specific, assume a parametrization of the decision problem. 
The size of the state space |Af | is parameterized by n, such that \X\ — t- oo as n — > oo. 
For instance, in SAT n represents the number of variables. Furthermore we assume 
that the fraction of the solution set p — >■ as n — >■ oo. The required sample size 
to obtain a relative accuracy e of the 95% confidence interval by the Monte Carlo 
estimation method is [H Chapter 6] 

1.96^ 



e'^p 



which increases like p~^ as n — > cxd. 

The purpose of the splitting method is to estimate p more efficiently via the 
following steps: 

1. Find a sequence of sets X = Xq, Xi, . . . , such that <Yo D D • • • D Xj^ = X*. 

2. Write \X*\ = \Xm\ as the telescoping product 

thus the target probability becomes a product p = YltLi ct, with ratio factors 

Xf , , 

= ^TT^. (2) 

3. Develop an efficient estimator q for each q and estimate \X*\ by 

m 

i=\X^\ = \X^\f,= \X^\^ct. (3) 

i=l 

It is readily seen that in order to obtain a meaningful estimator of \X*\^ we have to 
solve the following two major problems: 
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(i). Put the counting problem into the framework ([T|) by making sure that 



Xq::) XiZ) Xra = x\ (4) 

such that each q is not a rare-event probability. 

(ii). Obtain a low- variance estimator q of each ratio q. 

To this end, we propose an adaptive version of the splitting method. As a demon- 
stration, consider a specific family of decision problems, namely those whose solution 
set is finite and given by linear integer constraints. In other words, X* C is given 
by 

X>j = l ^ij'^j ~ i = 1, . . . , TTT-i; 

Z]j=i '^ij^j ^ i = mi + I, . . . ,mi + m2 = m] (5) 

G {0, 1,... Vj = l,...,n. 

Our goal is to count the number of feasible solutions (or points) to the set ([5]). 
Note that we assume that we know, or can compute easily, the bounding finite set 
<Y = {0, 1 . . . , d}", with points x = (xi, . . . , x„) (in this case \X\ = {d + 1)") as well 
for other counting problems. 

Below we follow [H]. Define the Boolean functions Cj : <Y ^ {0, 1} (i = 1, . . . , m) 

by 

i-^IEJ'.i a,jx,>b,}^ i = mi + l,...,mi + m2. 
Furthermore, define the function S : X ^ Z+ by counting how many constraints 
are satisfied by a point x X, i.e., S{x) = Ci{x). Now we can formulate the 

counting problem as a probabilistic problem of evaluating 

p = Ef [l{SiX)=m}] , (7) 

where X is a random point on X, uniformly distributed with probability density 
function (pdf) f{x), denoted by X ~ / = U{X). Consider an increasing sequence 
of thresholds = mo < mi < • • • < mx-i < mx = m, and define the sequence of 
decreasing sets (jl]) by 

Xt = {xeX : S{x) > mt}. 

Note that in this way 

Xt = {xe Xt-i : S{x) > mt}, 
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for t = 1,2, . . .. The latter representation is most useful since it shows that the ratio 
factor ct in ([2]) can be considered as a conditional expectation: 

where X ~ = U{Xt-i). Note that gt-i{x) is also obtained as a conditional pdf 
by 



(9) 

To draw samples from the uniform pdf <^t-i = i^i^t-i) on a complex set given 
implicitly, one applies typically MCMC methods. For further details we refer to 



2.1 The Basic Adaptive Splitting Algorithm 

We describe here the adaptive splitting algorithm from j^. The thresholds {mt) are 
not given in advance, but determined adaptively via a simulation process. Hence, 
the number T of thresholds becomes a random variable. In fact, the (mt)-thresholds 
should satisfy the requirements q = l/jA't-.i | ?a pt, where the parameters pt G 
(0, 1) are not too small, say pt > 0.01, and set in advance. We call these the splitting 
control parameters. In most applications we chose these all equal, that is pt = p. 

Consider a sample set = {Xi,...,XAr} of N random points in Xt-\. 

That is, all these points are uniformly distributed on Xi^\. Let mf be the (1 — pt„i)- 
th quantile of the ordered statistics values of the scores S{X.\), . . . , ^(Xjv). The elite 
set C consists of those points of the sample set for which S'(Xj) > m^. 

Let Ni be the size of the elite set. If all scores S'(Xj) would be distinct, it follows 
that the number of elites Nt = \Npt-i\, where [•] denotes rounding to the largest 
integer. However, dealing with a discrete space, typically we will find more samples 
with S{Xi) > rrit. All these are added to the elite set. Finally we remark that from 
([9]) it easily follows that the elite points are distributed uniformly on Xt- 

Having an elite set in Xt, we do two things. First, we screen out (delete) du- 

(s) 

plicates, so that we end up with a set of size Nl of distinct elites. Secondly, each 
screened elite is the starting point of a Markov chain simulation (MCMC method) 
on Xt using a transition probability matrix Pt with gt = hl{Xt) as its stationary dis- 
tribution. Because the starting point is uniformly distributed, all consecutive points 
on the sample path are uniformly distributed on Xt- Therefore, we may use all these 
points in the next iteration. 
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Suppose that each sample path has length bt = [N/N^ \, then we get a total of 
ht < N uniform points in Xt- To continue with the next iteration again with 

(s) 

a sample set of size N , we choose randomly N — bt of these sample paths and 
extend them by one point. Denote the new sample set by [X]t, and repeat the same 
procedure as above. The algorithm iterates until we find mt = m, say at iteration 
T, at which stage we stop and deliver 

_ T 

\X*\ = \X\llct (10) 

t=i 

as an estimator of \X*\, where Cf = Nt/N in iteration t. 

In our experiments we applied a Gibbs sampler to implement the MCMC simulation 
for obtaining uniformly distributed samples. To summarize, we give the algorithm. 

Algorithm 2.1 (Basic splitting algorithm for counting). 

1. Set a counter t = 1. Generate a sample set [X]q of N points uniformly distributed 
in Xq. Compute the threshold mi, and determine the size Ni of the elite set. Set 
ci = Ni/N as an estimator of ci = \Xi\/\Xq\. 

(s) 

2. Screen out the elite set to obtain distinct points uniformly distributed in Xt. 

3. Let bt = yN/N^''^ \. For all i = 1,2, .. . ,Ni^\ starting at the i-th screened elite 
point run a Markov chain of length bt on Xt with gt = U{Xt) as its stationary 

(s) 

distribution. Extend N — bt randomly chosen sample paths with one point. 
Denote the new sample set of size N by [X]t. 

4- Increase the counter t = t+1. Compute the threshold mt, and determine the size 
Nt of the elite set. Set ct = Nt/N as an estimator of ct = \Xt\/\Xt-i\. 

5. If mt = m deliver the estimator (jlOp ; otherwise repeat from step 2. 

3 Combining Splitting and Capture— Recapture 

In this section we discuss how to combine the well known capture-recapture (CAP- 
RECAP) method with the basic splitting Algorithm 12.11 First we present the clas- 
sical capture-cecapture algorithm in the literature. 
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3.1 The Classic Capture— Recapture in the Literature 

Originally the capture-recapture method was used to estimate the size, say M, of an 
unknown population on the basis of two independent samples from it. To see how 
the CAP-RECAP method works, consider an urn model with a total of M identical 
balls. Denote by A'^i and the sample sizes taken at the first and second draws, 
respectively. Assume in addition that 

• The second draw takes place after all A^i balls have been returned to the urn. 

• Before returning the A'^i balls, each is marked, say we painted them a different 
color. 

Denote by R the number of balls from the first draw that reappear in the second. 
Then an (biased) estimate M of M becomes 

R 

This is based on the observation that N2/M ~ R/Ni. Note that the name capture- 
recapture was borrowed from a problem of estimating the animal population size in 
a particular area on the basis of two visits. In this case R denotes the number of 
animals captured on the first visit and recaptured on the second. 
A slightly less biased estimator of M is 

M = l^^L±l)(^_i. (n) 

(R+l) ^ ^ 

See |16j for an analysis of the bias and for the derivation of an approximate unbiased 
estimator of the variance of M: 



E 



{Ni + l)(iV2 + l)(iVi - R)iN2 - R) 



Var(M). (12) 



(i? + l)2(i? + 2) 

3.2 SpHtting algorithm combined with Capture— Recapture 

Application of the CAP-RECAP to counting problems is trivial. We set = M 
and note that A''i and A''2 correspond to the screened-out samples at the first and 
second draws, which are performed after Algorithm 12.11 reaches the desired level m. 
Note that we need to remove duplicate samples because these do not occur in the 
capture-recapture method. 

As an example, let us assume that we run the splitting algorithm 12.11 till its 
last step T with N = 10, 000. After reaching the desired level m, we draw two 
independent sets of magnitude A'^i = 5000 and N2 = 5010 and assume that the 
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number of solutions that appeared in both draws in 10, i.e. R = 10. The CAP- 
RECAP estimator of denoted by |Af*|j,j^p is therefore 

I A^|,^p = 2, 505, 000. 

Our numerical results in Section H] clearly indicate that the CAP-RECAP estimator 
is typically more accurate than the product estimator (|10p . that is 

Yar[\X^\^J<Yar[\X^\], 

provided the sample is limited, say by 10,000 and \X*\ is large but also limited, 
say by 10^. 

We make a distinction for larger solution sets: if 10^ < < 10^, we apply an 
extended version of the capture-recapture method, as we will describe in the next 
section. If l^"*! is even larger {\X*\ > 10^), we can estimate it with the crude Monte 
Carlo. 

3.3 Extended Capture— Recapture Method 

Recall that the regular CAP-RECAP method 

1. Is implemented at the last iteration T of the splitting algorithm, that is when 
some configurations have already reached the desired set X*. 

2. It provides reliable estimators of \X*\ if it is not too large, say l^^"*! < 10^. 

In typical rare events counting problems, like SAT \X* \ is indeed < 10^, nevertheless 
we present below an extended CAP-RECAP version, which extends the original 
CAP-RECAP for 2-3 orders more, that is it provides reliable counting estimators 
for 10^ <\X*\< 10^. 

If not stated otherwise we shall have in mind a SAT problem. The enhanced 
CAP-RECAP algorithm involves additional constraints (clauses) and can be written 
as follows. 

Algorithm 3.1 (Extended CAP-RECAP). As soon as all m clauses Ci, . . . ,Cm 
of Xm have been reached by the splitting algorithm and it occurs that the resulting 
product estimator \Xm\ of \Xm\ is larger than > 10^ proceed as follows: 

1. Generate a sample Xi, . . . ,Xnx^ of uniformly distributed points in the desired 
problem set Xm by adding one by one some arbitrary auxiliary clauses until for 
some r we have that 

Nv 

— ■ " — '^m-^T 
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Here Cm+r is a relatively small number, fixed in advance, say 10^^ < Cm+r ^ 
10^^; furthermore, Nx^ CLnd Nx^^^ represent the respective number of points 
generated at X^n cind accepted at Xm+r- Note that the estimate cJ^J+^ is obtained 
as in Step 4 of the basic splitting algorithm \2.1\ 



2. Estimate \X*\ = \Xm\ by 

I '^"1 1 ecap ~ C-m+T ' I '^m+r I cap- ^^^) 

We call l-^mlecap extended CAP-RECAP estimator. It is essential to bear in 
mind that 

• I -^m+r I cap ^ CAP-RECAP estimator rather than a splitting (product) one. 

• I -^m I ecap does not contain the original estimators ci, . . . ,ar generated by the 
splitting method. 

• Since we only need here the uniformity of the samples at Xm-, we can run 
the splitting method of Section 12.11 all the way with relatively small values of 
sample size A'^ and splitting control parameter p until it reaches the vicinity of 
Xm denoted by X^-n where r is a small integer say, r = 1 or r = 2; and then 
switch to larger N and p. 

• In contrast to the splitting estimator which employs a product of T terms, 
formula (jl4p employs only a single c factor. Recall that this additional c^-|--7- 
factor allows to enlarge the CAP-RECAP estimators of \Xm\ for about two- 
three additional orders, namely from \Xm\ ~ 10^ to \Xm\ ~ 10^. 

4 Numerical Results 

Below we present numerical results with the splitting algorithm for counting. In 
particular we consider the following problems: 

1. The 3-satisfiability problem (3-SAT) 

2. Graphs with prescribed degrees 

3. Contingency tables 

For the 3-SAT problem we shall also use the the CAP-RECAP method. We shall 
show that typically CAP-RECAP outperforms the splitting algorithm. We shall use 
the following notations. 
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Notation A. For iteration t = 1, 2, . . . 



• Nt and N^''^ denote the actual number of elites and the number after screening, 
respectively; 

• TTij and m*t denote the upper and the lower elite levels reached, respectively 
(the m*t levels are the same as the rrit levels in the description of the algorithm) ; 

• pt is the splitting control parameter (we chose pt = p); 

• ct = Nt/N is the estimator of the t-th conditional probability; 

• product estimator = \'^\Y\l=iCi. 

4.1 The 3-Satisfiability Problem (3-SAT) 

There are tti clauses of length 3 taken from ii boolean (or binary) variables xi, • • • , 
A literal of the j-th variable is either TRUE (xj = 1) or FALSE {xj = <^ xj = 
1, where xj = NOT(xj)). A clause is a disjunction of literals. We assume that 
all clauses consist of 3 literals. The 3-SAT problem is defined as the problem of 
determining if the variables x = {xi, . . . can be assigned in a such way as to 
make all clauses TRUE. More formally, let X = {0, l}*^ be the set of all configurations 
of the n variables, and let Ci : X {0, 1}, be the m clauses. Then define (j) : X 
{0, 1} by 

m 

ct>{x) = l\Ci{x). 

i=l 

The original 3-SAT problem is to find a configuration of the Xj variables for which 
(j){x) = 1. In this work we are interested in the total number of such configurations 
(or feasible solutions). Then as discussed in Section [21 X* denotes the set of feasible 
solutions. Trivially, there are \X\ =2" configurations. 

The 3-SAT problems can also be converted into the family of decision problems 
([5|) given in Section [2l Define the m x n matrix A with entries aij € {—1, 0, 1} by 

— 1 if Xj G Ci , 

if Xj Ci and xj Ci, 

1 if Xj £ Ci. 

Furthermore, let b be the m- (column) vector with entries bi = 1 — \{j : aij = — 1}|. 
Then it is easy to see that for any configuration x S {0, 1}" 

X e X* 6(x) = 1 ^ Ax > b. 
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Below we compare the efficiencies of the classic and the extended CAP-RECAP 
with their splitting counterpart, bearing in mind that the extended CAP-RECAP 
version is used for larger values of then the classic one. As an example we 
consider the estimation of [A^*! for the 3-SAT problem with an instance matrix A 
of dimension (122 x 515), meaning n = 122, m = 515. In particular Table [J presents 
the the performance of the splitting Algorithm 12.11 based on 10 independent runs 
using = 25,000 and p = 0.1, while Table [2] shoews the dynamics of a run of the 
Algorithm 12.11 for the same data. 

Table 1: Performance of splitting algorithm for the 3-SAT (122 x 515) model with 
N = 25,000 and p = 0.1. 



Run 


nr. of its. 


\X*\ 


CPU 


1 


33 


1.41E+06 


212.32 


2 


33 


l.lOE+06 


213.21 


3 


33 


1.68E+06 


214.05 


4 


33 


1.21E+06 


215.5 


5 


33 


1.21E+06 


214.15 


6 


33 


1.47E+06 


216.05 


7 


33 


1.50E+06 


252.25 


8 


33 


1.73E+06 


243.26 


9 


33 


1.21E+06 


238.63 


10 


33 


1.88E+06 


224.36 


Average 


33 


1.44E+06 


224.38 



The relative error, denoted by RE is 1.815E' — 01. Notice that the relative error 
of a random variable Z is calculated by the standard formula, namely 

RE = S/£, 

where 
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Table 2: Dynamics of a run of the splitting algorithm for the 3-SAT (122 x 515) 
model using = 25, 000 and p = 0.1. 



t 


1^1 


Nt 


Nl ^ 


m* 


m^t 


Ct 


1 


6 


53E+35 


3069 


3069 


480 


460 


1 


23E- 


01 


2 


8 


78E+34 


3364 


3364 


483 


467 


1 


35E- 


01 


3 


1 


15E+34 


3270 


3270 


484 


472 


1 


31E- 


01 


4 


1 


50E+33 


3269 


3269 


489 


476 


1 


31E- 


01 


5 


2 


49E+32 


4151 


4151 


490 


479 


1 


66E- 


01 


6 


3 


37E+31 


3379 


3379 


492 


482 


1 


35E- 


01 


7 


3 


41E+30 


2527 


2527 


494 


485 


1 


OlE- 


01 


8 


6 


19E+29 


4538 


4538 


495 


487 


1 


82E- 


01 


9 


9 


85E+28 


3981 


3981 


497 


489 


1 


59E- 


01 


10 


1 


31E+28 


3316 


3316 


498 


491 


1 


33E- 


01 


11 


1 


46E+27 


2797 


2797 


501 


493 


1 


12E- 


01 


12 


4 


61E+26 


7884 


7884 


501 


494 


3 


15E- 


01 


13 


1 


36E+26 


7380 


7380 


501 


495 


2 


95E- 


01 


14 


3 


89E+25 


7150 


7150 


502 


496 


2 


86E- 


01 


15 


1 


06E+25 


6782 


6782 


505 


497 


2 


71E- 


01 


16 


2 


69E+24 


6364 


6364 


503 


498 


2 


55E- 


01 


17 


6 


42E+23 


5969 


5969 


504 


499 


2 


39E- 


01 


18 


1 


42E+23 


5525 


5525 


506 


500 


2 


21E- 


01 


19 


3 


03E+22 


5333 


5333 


505 


501 


2 


13E- 


01 


20 


5 


87E+21 


4850 


4850 


506 


502 


1 


94E- 


01 


91 


1 


06E+21 






ou / 


OUo 


1 


80E- 


01 


22 


1 


71E+20 


4061 


4061 


507 


504 


1 


62E- 


01 


23 


2 


50E+19 


3647 


3647 


509 


505 


1 


46E- 


01 


24 


3 


26E+18 


3260 


3260 


510 


506 


1 


30E- 


01 


25 


3 


62E+17 


2778 


2778 


510 


507 


1 


llE- 


01 


26 


3 


68E+16 


2539 


2539 


510 


508 


1 


02E- 


01 


27 


3 


05E+15 


2070 


2070 


511 


509 


8 


28E- 


02 


28 


2 


17E+14 


1782 


1782 


512 


510 


7 


13E- 


02 


29 


1 


21E+13 


1398 


1398 


513 


511 


5 


59E- 


02 


30 


5 


OOE+11 


1030 


1030 


513 


512 


4 


12E- 


02 


31 


1 


49E+10 


743 


743 


514 


513 


2 


97E- 


02 


32 


2 


39E+08 


402 


402 


515 


514 


1 


61E- 


02 


33 


1 


43E+06 


150 


150 


515 


515 


6 


OOE- 


03 



We increased the sample size at the last two iterations from = 25, 000 to 
= 100, 000 to get a more accurate estimator. 

As can be seen from Table [H the estimator > 10^, hence for this instance the 
extended CAP-RECAP Algorithm 13.11 can also be used. We shall show that the 
relative error (RE) of the extended CAP-RECAP estimator l^^mlecap ^^^^ than 
that of Before doing so we need to find the extended 3-SAT instance matrix 

(122 X 515) -|- r, where r is the number of auxiliary clauses. Applying the extended 
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CAP-RECAP Algorithm 13.11 we found that r = 5 and thus the extended instance 
matrix is (122 x 520). Recall that the cardinality |<^m+T| of the extended (122 x 520) 
model should be manageable by the regular CAP-RECAP, that is we assumed that 
|A:'m+T| < 10^. Indeed, TableOpresents the performance of the regular CAP-RECAP 
for that extended (122 x 520) model. Here we used again p = 0.1. As for the sample 
size, we took = 1, 000 until iteration 28 and then switched to = 100, 000. The 
final CAP-RECAP estimator is obtained by taking two equal samples, each of size 
N = 70,000 at the final subset Xm+T = '^520- (The sample sizes that were used in 
the estimation are smaller due to the screening step.) 

Table 3: Performance of the regular CAP-RECAP for the extended (122 x 520) 
model. 



Run 


nr. of its. 


1^1 

1 leap 


CPU 


1 


34 


5.53E+04 


159.05 


2 


35 


5.49E+04 


174.46 


3 


35 


5.51E+04 


178.08 


4 


34 


5.51E+04 


166.36 


5 


34 


5.52E+04 


159.36 


6 


33 


5.52E+04 


152.38 


7 


33 


5.54E+04 


137.96 


8 


34 


5.50E+04 


157.37 


9 


35 


5.51E+04 


179.08 


10 


34 


5.51E+04 


163.7 


Average 


34.1 


5.51E+04 


162.78 



The relative error of |'^*lcap o^^r 10 runs is 2.600i? — 03. 

Next we compare the efficiency of the regular CAP-RECAP (as per Table [3]) with 
that of the splitting algorithm for the extended (122 x 520) model. Table H] presents 
the performance of splitting for p = 0.1 and N = 100,000. It readily follows that 
the relative error of the regular CAP-RECAP is about 30 times less than that of 
splitting. Notice in addition that the CPU time of CAP-RECAP is about 6 times 
less than that of splitting. This is so since the total sample size of the former is about 
6 time less than of the latter. Thus the overall speed up obtained by CAP-RECAP 
is about 5, 000 times. 
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Table 4: Performance of splitting algorithm for the 3-SAT (122 x 520) model. 



Run 


nr. of its. 


\X*\ 


CPU 


1 


34 


6.03E+04 


900.28 


2 


34 


7.48E+04 


904.23 


3 


34 


4.50E+04 


913.31 


4 


34 


5.99E+04 


912.27 


5 


34 


6.03E+04 


910.44 


6 


33 


4.94E+04 


898.91 


7 


34 


5.22E+04 


931.88 


8 


34 


5.74E+04 


916.8 


9 


34 


5.85E+04 


919.63 


10 


34 


5.72E+04 


927.7 


Average 


33.9 


5.75E+04 


913.54 



The relative error of \X*\ over 10 rmis is 1.315-E — 01. 

With these results at hand we can proceed with the extended CAP-RECAP and 
compare its efficiency with splitting (see Tabled]) for the instance matrix (122 x 515). 
Table [5] presents the performance of the extended CAP-RECAP estimator |<Y*|ecap 
for the (122 x 515) model along with the performance of the regular CAP-RECAP 
estimator |<^*lcap t^i^ (122 x 520) model (see also the results of Table [3] for 
l'^*lcap)- again p = 0.1. Regarding the sample size we took N = 1,000 for 

the first 31 iterations and then switched to = 100, 000 until reaching the level 
m = 515. Recall that the level m -\- t = 520 and the corresponding CAP-RECAP 
estimator |'Y*|(,ap was obtained from the set Xm = -^515 by adding r = 5 more 
auxiliary clauses. Note that in this case we used for |<Y*|cap two equal samples each 
of length N = 100, 000. 

Comparing the results of Table [1] with that of Table [5] it is readily seen that 
the extended CAP-RECAP estimator |A'*|gj,j^p outperforms the splitting one \X* \ in 
both RE and CPU time. In particular, we have that both RE and CPU times of the 
former are about 1.6 times less than of the latter. This means that the overall speed 
up obtained by |<^*|g(.ap versus \X*\ is about 1,6^ • 1.6 s^:: 4 times. Note finally that 
the total number of samples used in the extended CAP-RECAP estimator |'^*lccap 
is about = 500, 000, while in its counterpart - the splitting estimator \X* \ is about 
N = 50, 000 * 36 = 1, 800, 000. 
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Table 5: Performance of the extended CAP-RECAP estimator |A'*|g^j^p for the (122 x 
515) model along with the regular CAP-RECAP one l^l^ap for the (122 x 520) 
model. 



Run 


nr. its. 




l'^*lcap 


ccap 


CPU 


1 


33 


3.13E-02 


5.41E+04 


1.73E+06 


138.99 


2 


34 


3.47E-02 


5.51E+04 


1.59E+06 


154.64 


3 


34 


3.55E-02 


5.52E+04 


1.55E+06 


161.78 


4 


33 


4.51E-02 


5.40E+04 


1.20E+06 


163.53 


5 


34 


3.04E-02 


5.13E+04 


1.69E+06 


143.84 


6 


34 


2.99E-02 


5.41E+04 


1.81E+06 


151.1 


7 


34 


4.27E-02 


5.51E+04 


1.29E+06 


174.08 


8 


34 


3.87E-02 


5.42E+04 


1.40E+06 


143.27 


9 


33 


3.27E-02 


5.42E+04 


1.66E+06 


171.07 


10 


34 


4.22E-02 


5.51E+04 


1.30E+06 


154.71 


Average 


33.7 


3.63E-02 


5.42E+04 


1.52E+06 


155.70 



The relative error of |'^*|cap over 10 runs is 2.010i? — 02. 
The relative error of |'^*lecap ^^^^ runs is 1.315£' — 01. 

4.2 Random graphs with prescribed degrees 

Random graphs with given vertex degrees have attained attention as a model for real- 
world complex networks, including World Wide Web, social networks and biological 
networks. The problem is basically finding a graph G = {V, E) with n vertices, given 
the degree sequence d = {di, . . . ,dn) formed of nonnegative integers. Following [3], 
a finite sequence {di, . . . ,dn) of nonnegative integers is called graphical if there is a 
labeled simple graph with vertex set {1, . . . , n} in which vertex i has degree di. Such 
a graph is called a realization of the degree sequence (di, . . . , d„). We are interested 
in the total number of realizations for a given degree sequence, hence X* denotes 
the set of all graphs G = {V, E) with the degree sequence (di, . . . , d„). 

Similar to ([5]) for SAT we convert the problem into a counting problem. To pro- 
ceed consider the complete graph Kn of n vertices, in which each vertex is connected 
with all other vertices. Thus the total number of edges in Kn is m = n{n — l)/2, 
labeled ei, . . . ,6^- The random graph problem with prescribed degrees is trans- 
lated to the problem of choosing those edges of Kn such that the resulting graph G 
matches the given sequence d. Set Xi = 1 when Cj is chosen, and = otherwise, 
i = 1, . . . ,m. In order that such an assignment x G {0, 1}"* matches the given de- 
gree sequence {di, . . . , dn), it holds necessarily that X^JLi xj = ^ Yl^=i ^i, since this 
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is the total number of edges. In other words, the configuration space is 



m n 



X = I X £ {0,1}"" -.^Xj = ^^di 



Let A be the incidence matrix of Kn with entries 




Vi ^ Cj 

1 Vi & ej . 



It is easy to see that whenever a configuration x G {0, l}™ satisfies Ax = d, the 
associated graph has degree sequence (di, . . . , d^). We conclude that the problem 
set is represented by 



Wc first present a small example as illustration. Let d = (2,2,2, 1,3) with n = 5, 
and m = 10. After ordering the edges of lexicographically, the corresponding 
incidence matrix is given as 



It is readily seen that the following x = (0, 0, 1, 1, 1, 0, 1, 0, 1, 0)', and x = (1, 0, 0, 1, 1, 0, 0, 0, 1, 1)' 
present two solutions of this example. 

For the random graph problem we define the score function S : X ^ Z_ by 



where deg{vi) is the degree of vertex i under the configuration x. Each configuration 
that satisfies the degree sequence will have a performance function equal to 0. 

The implementation of the Gibbs sampler for this problem is slightly different 
than for the 3-SAT problem, since we keep the number of edges in each realization 
fixed to Yldi/2. Our first algorithm takes care of this requirement and generates a 
random x E X. 

Algorithm 4.1. Let (cii, . . . ,d„) be the prescribed degrees sequence. 
• Generate a random permutation of 1, ... ,m. 



X* = {xeX ■.Ax = d}. 



A = 



/l 1 1 1 0\ 

1 1 1 1 

1 1 1 1 

1 1 1 1 

\0 00100101 1/ 



n 



1=1 
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• Choose the first places in this permutation and deliver a vector x having 

one's in those places. 

The adaptive thresholds in the basic sphtting algorithm are negative, increasing 
to 0: 

mi < 7712 < • • • < rriT-i < mx = 0. 

The resulting Gibbs sampler (in Step 3 of the basic splitting algorithm starting with 
a configuration x ^ X for which S{x) > mt) can be written as follows. 

Algorithm 4.2 (Gibbs Algorithm for random graph sampling). For each edge Xi = 
1, while keeping all other edges fixed, do: 

1. Remove Xi from x, i.e. make xi = 0. 

2. Check all possible placements for the edge resulting a new vector x conditioning 
on the performance function S{x) > mt 

3. With uniform probability choose one of the valid realizations. 

We will apply the splitting algorithm to two problems taken from [3]. 

4.2.1 A small problem 

For this small problem we have the degree sequence 

d=(5,6,l^_^). 

11 ones 

The solution can be obtained analytically and already given in f3]: 

"To count the number of labeled graphs with this degree sequence, note 
that there are ( g^) = 462 such graphs with vertex 1 not joined to vertex 
2 by an edge (these graphs look like two separate stars), and there are 
(V) (D ~ 6930 such graphs with an edge between vertices 1 and 2 (these 
look like two joined stars with an isolated edge left over). Thus, the total 
number of realizations of d is 7392." 

As we can see from Table [H the algorithm easily handles the problem. Table [7] 
presents the typical dynamics. 
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Table 6: Performance of the splitting algorithm for a small problem using N = 
50,000 and p = 0.5. 



Run 


nr. of its. 


1^1 


CPU 


1 


10 


7146.2 


15.723 


2 


10 


7169.2 


15.251 


3 


10 


7468.7 


15.664 


4 


10 


7145.9 


15.453 


5 


10 


7583 


15.555 


6 


10 


7206.4 


15.454 


7 


10 


7079.3 


15.495 


8 


10 


7545.1 


15.347 


9 


10 


7597.2 


15.836 


10 


10 


7181.2 


15.612 


Average 


10 


7312.2 


15.539 



The relative error of over 10 runs is 2.710ii^ — 02. 

Table 7: Typical dynamics of the splitting algorithm for a small problem using 
N = 50,000 and p = 0.5 (recall Notation [A] at the beginning of Section H]). 



t 


1^1 


Nt 






m^t 




1 


4.55E+12 


29227 


29227 


-4 


-30 


0.5845 


2 


2.56E+12 


28144 


28144 


-4 


-18 


0.5629 


3 


1.09E+12 


21227 


21227 


-6 


-16 


0.4245 


4 


3.38E+11 


15565 


15565 


-4 


-14 


0.3113 


5 


7.51E+10 


11104 


11104 


-4 


-12 


0.2221 


6 


l.llE+10 


7408 


7408 


-2 


-10 


0.1482 


7 


1.03E+09 


4628 


4628 


-2 


-8 


0.0926 


8 


5.37E+07 


2608 


2608 


-2 


-6 


0.0522 


9 


1.26E+06 


1175 


1175 





-4 


0.0235 


10 


7223.9 


286 


280 





-2 


0.0057 



4.2.2 A large problem 

A much harder instance (see [3j) is defined by 

d = (7, 8, 5, 1, 1, 2, 8, 10, 4, 2, 4, 5, 3, 6, 7, 3, 2, 7, 6, 1,2, 9, 6, 1, 3, 4, 6, 3, 3, 3, 2, 4, 4). 

In [3] the number of such graphs is estimated to be about 1.533 x 10^' Table [8] 
presents 10 runs using the splitting algorithm. 
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Tabic 8: Performance of the splitting alj 
100, 000 and p = 0.5. 



[gorithm for a large problem using N = 



Riiii 


nr. its. 


|,V"'| 


CPU 


1 


39 


1.66E+57 


4295 


2 


39 


1.58E+57 


4223 


3 


39 


1.58E+57 


4116 


4 


39 


1.53E+57 


4281 


5 


39 


1.76E+57 


4301 


6 


39 


1.75E+57 


4094 


7 


39 


1.46E+57 


4512 


8 


39 


1.71E+57 


4287 


9 


39 


1.39E+57 


4158 


10 


39 


1.38E+57 


4264 


Average 


39 


1.58E+57 


4253 



The relative error of lA"*! over 10 runs is 8A3QE — 02. 
4.3 BinEiry Contingency Tables 

Given are two vectors of positive integers r = (ri, . . . , r^) and c = (ci, . . . , c„) such 
that ri < n for all i, Cj < n for all j, and YliLi = Sj=i '^j- ^ binary contingency 
table with row sums r and column sums c is a m x n matrix X of zero-one entries 
Xij satisfying ^"=1 Xij = ri for every row i and Yl^i ^ij — every column j. 

The problem is to count all contingency tables. 

The extension of the proposed Gibbs sampler for counting the contingency tables 
is straightforward. We define the configuration space X = X^-'^ U A'^'^) as the space 
where all column or row sums are satisfied: 



A'(^) = \xe{0, 1}-+" : J2 ^ij = ^0 Vi 



1=1 

n 

XG{0,ir+":^a;i,=riVi 



Clearly we can sample uniformly at random from X without any problem. The score 
function 5 : A* — ^ Z_ is defined by 



S{X) 



-T.T=i\Trj=iXij-ri\, for XGA'(^), 
-Y.U\T.T=i^i3 - for XGA'W, 
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that is, the difference of the row sums Y^j=i xij with the target rj if the column 
sums are right, and vice versa. 

The Gibbs sampler is very similar to the one in the previous section concerning 
random graphs with prescribed degrees. 

Algorithm 4.3 (Gibbs algorithm for random contingency tables sampling). Given 
a matrix realization X € A'^'^^ with score S{X) > rrit. For each column j and for 
each 1-entry in this column (xij = I) do: 

1. Remove this 1, i.e. set x[j = 0. 

2. Check all possible placements for this 1 in the given column j conditioning on the 
performance function S{X') > mt (X' is the matrix resulting by setting x[j = 0, 
x'-,j = 1 for some Xi'j = 0, and all other entries remain unchanged). 

3. Suppose that the set of valid realization is A = {X'\S{X') > nif}. (Please note 
that this set also contains the original realization X ). Than with probability 
pick any realization at random and continue with step 1. 

Note that in this way we keep the column sums correct. Similarly, when we 
started with a matrix configuration with all row sums correct, we execute these 
steps for each row and swap 1 and per row. 

4.3.1 Model 1 

The date are m = 12,n = 12 with row and column sums 

r = (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2), c = (2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2). 

The true count value is known to be 21,959,547,410,077,200. Table [9] presents 10 
runs using the splitting algorithm. Table [TOl presents a typical dynamics. 



21 



Table 9: Performance of the splitting algorithm for Model 1 using A'" = 50, 000 and 
p = 0.5. 



Run 


nr. its. 




CPU 


1 


7 


2.15E+16 


4.54 


2 


7 


2.32E+16 


4.55 


3 


7 


2.23E+16 


4.54 


4 


7 


2.11E+16 


4.58 


5 


7 


2.05E+16 


4.57 


6 


7 


2.23E+16 


4.54 


7 


7 


2.02E+16 


4.55 


8 


7 


2.38E+16 


4.58 


9 


7 


2.06E+16 


4.57 


10 


7 


2.14E+16 


4.55 


Average 


7 


2.17E+16 


4.56 



The relative error of \X*\ over 10 runs is 5.210ii^ — 02. 

Table 10: Typical dynamics of the splitting algorithm for Model 1 using N = 50, 000 
and p = 0.5. 



t 


m 


Nt 










1 


4.56E+21 


13361 


13361 


-2 


-24 


0.6681 


2 


2.68E+21 


11747 


11747 


-2 


-12 


0.5874 


3 


l.lOE+21 


8234 


8234 


-2 


-10 


0.4117 


4 


2.76E+20 


5003 


5003 


-2 


-8 


0.2502 


5 


3.45E+19 


2497 


2497 





-6 


0.1249 


6 


1.92E+18 


1112 


1112 





-4 


0.0556 


7 


2.08E+16 


217 


217 





-2 


0.0109 



4.3.2 Model 2 

Darwin's Finch Data from Yuguo Chen, Persi Diaconis, Susan P. Holmes, and Jun 
S. Liu: m = 12, n = 17 with row and columns sums 

r = (14, 13, 14, 10, 12, 2, 10, 1, 10, 11, 6, 2), c = (3, 3, 10, 9, 9, 7, 8, 9, 7, 8, 2, 9, 3, 6, 8, 2, 2). 

The true count value is known to be 67, 149, 106, 137, 567, 600. Table [TT] presents 10 
runs using the splitting algorithm. 
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Table 11: Performance of the splitting algorithm for Model 2 using N = 200,000 
and p = 0.5. 



Run 


nr. its. 




CTU 


1 


24 


6.16E+16 


246.83 


2 


24 


6.50E+16 


244.42 


3 


24 


7.07E+16 


252.71 


4 


24 


7.91E+16 


247.36 


5 


24 


6.61E+16 


260.99 


6 


24 


6.77E+16 


264.07 


7 


24 


6.59E+16 


269.86 


8 


24 


6.51E+16 


273.51 


9 


24 


7.10E+16 


272.49 


10 


24 


5.91E+16 


267.23 


Average 


24 


6.71E+16 


259.95 



The relative error of over 10 runs is 7.850£? — 02. 

5 Concluding Remarks 

In this paper we applied the splitting method to several well-known counting prob- 
lems, like 3-SAT, random graphs with prescribed degrees and binary contingency 
tables. While implementing the splitting algorithm, wc discussed several MCMC 
algorithms and in particular the Gibbs sampler. Wc show how to incorporate the 
classic capture-recapture method in the splitting algorithm in order to obtain a low 
variance estimator for the desired counting quantity. Furthermore, we presented an 
extended version of the capture-recapture algorithm, which is suitable for problems 
with a larger number of feasible solutions. We finally presented numerical results 
with the splitting and capture-recapture estimators, and showed the superiority of 
the latter. 
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